System and method for converting audio-to-text with delay

ABSTRACT

Described herein is a system and method for generating text caption information for an audio-video (AV) signal, the systems and method comprising: receiving an AV signal; extracting audio from the AV signal to form an extracted audio signal; time stamping both the extracted audio signal and the received AV signal; partitioning the extracted audio signal into a first predetermined duration segment of extracted audio signal; generating text captions from the partitioned extracted audio signal over a first duration, and converting the same to a video text signal, with the same time stamp as the extracted audio signal and received AV signal; delaying the received AV signal by an amount of time substantially similar to the first duration; combining the time stamped video text signal and the delayed time stamped received AV signal based on the time stamps; and outputting the combined time stamped video text signal and the time stamped received AV signal to a display.

PRIORITY INFORMATION

The present application claims priority under 35 U.S.C. § 119(e) to U.S.Provisional Pat. Application Serial No. 63/282,320 filed Nov. 23, 2021,the entire contents of which are expressly incorporated herein byreference.

BACKGROUND OF THE INVENTION Technical Field

The embodiments described herein relate generally to audio systems, andmore particularly to systems, methods, and modes for alleviating theproblems of delays between video and live captions for deaf and/or hardof hearing people.

Background Art

Often times, when using live captions for deaf or hard of hearingpeople, there is a noticeable delay or disconnect between the video andthe transcribed audio. This noticeable disconnect or delay can be oflittle consequence if two people are merely talking to each other in thevideo, but if other video is displayed, such as sports or video thatshows events happening, then the delay between the video and the livecaptions can be very disconcerting.

Accordingly, a need has arisen for systems, methods, and modes foralleviating the problems of delays between video and live captions fordeaf and/or hard of hearing people.

SUMMARY

It is an object of the embodiments to substantially solve at least theproblems and/or disadvantages discussed above, and to provide at leastone or more of the advantages described below.

It is therefore a general aspect of the embodiments to provide systems,methods, and modes for alleviating the problems of delays between videoand live captions for deaf and/or hard of hearing people that willobviate or minimize problems of the type previously described.

This Summary is provided to introduce a selection of concepts in asimplified form that are further described below in the DetailedDescription. This Summary is not intended to identify key features oressential features of the claimed subject matter, nor is it intended tobe used to limit the scope of the claimed subject matter.

Further features and advantages of the aspects of the embodiments, aswell as the structure and operation of the various embodiments, aredescribed in detail below with reference to the accompanying drawings.It is noted that the aspects of the embodiments are not limited to thespecific embodiments described herein. Such embodiments are presentedherein for illustrative purposes only. Additional embodiments will beapparent to persons skilled in the relevant art(s) based on theteachings contained herein.

According to a first aspect of the embodiments, a method for generatingtext caption information for an audio-video (AV) signal is provided, themethod comprising: receiving an AV signal; extracting audio from the AVsignal to form an extracted audio signal; time stamping both theextracted audio signal and the received AV signal; partitioning theextracted audio signal into a first predetermined duration segment ofextracted audio signal; generating text captions from the partitionedextracted audio signal over a first duration, and converting the same toa video text signal, with the same time stamp as the extracted audiosignal and received AV signal; delaying the received AV signal by anamount of time substantially similar to the first duration; combiningthe time stamped video text signal and the delayed time stamped receivedAV signal based on the time stamps; and outputting the combined timestamped video text signal and the time stamped received AV signal to adisplay.

According to the first aspect of the embodiments, the step of generatingtext captions from the partitioned extracted audio signal over a firstduration further comprises: comparing the generated text captions with alist of text obtained by a source of the AV signal to improve accuracyof the generated text captions.

According to the first aspect of the embodiments, the list of textobtained by the source of the AV signal comprises text associated withthe subject matter of the AV signal.

According to the first aspect of the embodiments, the step of generatingtext captions from the partitioned extracted audio signal over a firstduration further comprises: obtaining metadata from the AV signal;generating a list of text that substantially matches the subject matterof the AV signal based on the obtained metadata; comparing the generatedtext captions with the generated list of text based on the obtainedmetadata to improve accuracy of the generated text captions.

According to the first aspect of the embodiments, the step of generatingtext captions from the partitioned extracted audio signal over a firstduration further comprises: using artificial intelligence programmingtechniques to develop a list of text that substantially matches thesubject matter of the AV signal based on the obtained metadata;comparing the generated text captions with the AI developed list of textto improve accuracy of the generated text captions.

According to the first aspect of the embodiments, the AI programmingtechniques comprise: Recurrent Neural Networks that are trained tosuppress non-voice audio resulting in significantly improved voicesignal-to-noise ratio (SNR) and clarity.

According to a second aspect of the embodiments, a system for generatingtext caption information for an audio-video (AV) signal system isprovided, the system comprising: an audio-video (AV) signal receiver; atleast one processor that is part of the AV signal receiver; a memoryoperatively connected with the at least one processor, wherein thememory stores computer-executable instructions that, when executed bythe at least one processor, causes the at least one processor to executea method that comprises: receiving an AV signal at the AV signalreceiver; extracting audio from the AV signal to form an extracted audiosignal; time stamping both the extracted audio signal and the receivedAV signal; partitioning the extracted audio signal into a firstpredetermined duration segment of extracted audio signal; generatingtext captions from the partitioned extracted audio signal over a firstduration, and converting the same to a video text signal, with the sametime stamp as the extracted audio signal and received AV signal;delaying the received AV signal by an amount of time substantiallysimilar to the first duration; combining the time stamped video textsignal and the delayed time stamped received AV signal based on the timestamps; and outputting the combined time stamped video text signal andthe time stamped received AV signal to a display.

According to the second aspect of the embodiments, the step ofgenerating text captions from the partitioned extracted audio signalover a first duration further comprises: comparing the generated textcaptions with a list of text obtained by a source of the AV signal toimprove accuracy of the generated text captions.

According to the second aspect of the embodiments, the list of textobtained by the source of the AV signal comprises text associated withthe subject matter of the AV signal.

According to the second aspect of the embodiments, the step ofgenerating text captions from the partitioned extracted audio signalover a first duration further comprises: obtaining metadata from the AVsignal; generating a list of text that substantially matches the subjectmatter of the AV signal based on the obtained metadata; comparing thegenerated text captions with the generated list of text based on theobtained metadata to improve accuracy of the generated text captions.

According to the second aspect of the embodiments, the step ofgenerating text captions from the partitioned extracted audio signalover a first duration further comprises: using artificial intelligenceprogramming techniques to develop a list of text that substantiallymatches the subject matter of the AV signal based on the obtainedmetadata; comparing the generated text captions with the AI developedlist of text to improve accuracy of the generated text captions.

According to the second aspect of the embodiments, the AI programmingtechniques comprises: Recurrent Neural Networks that are trained tosuppress non-voice audio resulting in significantly improved voicesignal-to-noise ratio (SNR) and clarity.

According to the second aspect of the embodiments, the AV signalreceiver, at least one processor and memory are part of an audio videodisplay device.

BRIEF DESCRIPTION OF THE DRAWINGS

The above and other objects and features of the embodiments will becomeapparent and more readily appreciated from the following description ofthe embodiments with reference to the following figures. Differentaspects of the embodiments are illustrated in reference figures of thedrawings. It is intended that the embodiments and figures disclosedherein are to be considered to be illustrative rather than limiting. Thecomponents in the drawings are not necessarily drawn to scale, emphasisinstead being placed upon clearly illustrating the principles of theaspects of the embodiments. In the drawings, like reference numeralsdesignate corresponding parts throughout the several views.

FIG. 1 illustrates a functional block diagram of an audio-to-textconversion and audio-video signal delay circuit for use in anaudio-video playback device or system, according to aspects of theembodiments.

FIG. 2 illustrates a flow chart of a method for converting audio-to-textand adding delay to the audio-video signal using the audio-to-textconversion and audio-video signal delay circuit shown in FIG. 1according to aspects of the embodiments.

FIG. 3 illustrates a block diagram of the major components of a personalcomputer (PC), server, laptop, personal electronic device (PED),personal digital assistant (PDA), tablet (e.g., iPad), or any othercomputer/processor (herein after, “processing device”) suitable for useto implement the method shown in FIG. 2 for converting audio-to-text andadding delay to the audio-video signal using the audio-to-textconversion and audio-video signal delay circuit shown in FIG. 1according to aspects of the embodiments.

FIG. 4 illustrates a network system within which the system and methodfor substantially automatically converting audio-to-text with a delayusing the audio-to-text conversion and audio-video signal delay circuitshown in FIG. 1 according to aspects of the embodiments.

DETAILED DESCRIPTION

The embodiments are described more fully hereinafter with reference tothe accompanying drawings, in which embodiments of the inventive conceptare shown. In the drawings, the size and relative sizes of layers andregions may be exaggerated for clarity. Like numbers refer to likeelements throughout. The embodiments may, however, be embodied in manydifferent forms and should not be construed as limited to theembodiments set forth herein. Rather, these embodiments are provided sothat this disclosure will be thorough and complete, and will fullyconvey the scope of the inventive concept to those skilled in the art.The scope of the embodiments is therefore defined by the appendedclaims. The detailed description that follows is written from the pointof view of a control systems company, so it is to be understood thatgenerally the concepts discussed herein are applicable to varioussubsystems and not limited to only a particular controlled device orclass of devices, such as audio networks, but can be used in virtuallyany type of audio playback system.

Reference throughout the specification to “one embodiment” or “anembodiment” means that a particular feature, structure, orcharacteristic described in connection with an embodiment is included inat least one embodiment of the embodiments. Thus, the appearance of thephrases “in one embodiment” or “in an embodiment” in various placesthroughout the specification is not necessarily referring to the sameembodiment. Further, the particular feature, structures, orcharacteristics may be combined in any suitable manner in one or moreembodiments.

The different aspects of the embodiments described herein pertain to thecontext of systems, methods, and modes for alleviating the problems ofdelays between video and live captions for deaf and/or hard of hearingpeople, but is not limited thereto, except as may be set forth expresslyin the appended claims.

For 40 years Crestron Electronics Inc., has been the world’s leadingmanufacturer of advanced control and automation systems, innovatingtechnology to simplify and enhance modem lifestyles and businesses.Crestron designs, manufactures, and offers for sale integrated solutionsto control audio, video, computer, and environmental systems. Inaddition, the devices and systems offered by Crestron streamlinestechnology, improving the quality of life in commercial buildings,universities, hotels, hospitals, and homes, among other locations.Accordingly, the systems, methods, and modes described herein canimprove audio systems as discussed below.

The systems, methods, and modes described herein substantially alleviatethe problems of delays between video and live captions for deaf and/orhard of hearing people

In the following detailed description, references are made to theaccompanying drawings that form a part hereof, and in which are shown byway of illustrations, specific embodiments, or examples. These aspectsmay be combined, other aspects may be utilized, and structural changesmay be made without departing from the spirit or scope of the presentdisclosure. The following detailed description is therefore not to betaken in a limiting sense, and the scope of the present invention isdefined by the appended claims and their equivalents.

While some embodiments will be described in the general context ofprogram modules that execute in conjunction with an application programthat runs on an operating system on a personal computer, those skilledin the art will recognize that aspects may also be implemented incombination with other program modules.

The following is a list of the elements of the figures in numericalorder:

100 Audio Video Delay (AVD) Circuit 102 Audio Extractor Device 104 AudioVideo Receiver 106 Caption Generating Device 108 Delay Device 110Combiner/Re-combiner 112 Clock 114 Processor 116 Memory 118 Audio VideoDelay & Captioning Software Application (AVDC App) 120 Audio Video (AV)Display 122 Network 124 Cloud Based Digital Audio Video Sources 126Other Digital Audio Video Sources 128 Analog Audio Video Sources 130Analog Audio Video Receiver & Analog-to-Digital Converter Processing 200Method for Generating Captions for Video and Delaying the Video toEnsure Synchronized Captions and Video 202–210 Steps of Method 200 300Processing Device 304 Microprocessor Internal Memory 306 ComputerOperating System (OS) VGA 308 Internal Data/Command Bus (Bus) 312Read-Only Memory (ROM) 314 Random Access Memory (RAM) 316 PrintedCircuit Board (PCB) 318 Hard Disk Drive (HDD) 320 Universal Serial Bus(USB) Port 322 Ethernet Port 324 Video Graphics Array (VGA) Port or HighDefinition Multimedia Interface (HDMI) 326 Compact Disk (CD)/DigitalVideo Disk (DVD) Read/Write (RW) (CD/DVD/RW) Drive 328 Floppy DisketteDrive (FDD) 330 Integrated Display/Touchscreen (Laptop/Tablet etc.) 332Wi-Fi Transceiver 334 BlueTooth (BT) Transceiver 336 Near FieldCommunications (NFC) Transceiver 338 Third Generation (3G), FourthGeneration (4G), Fifth Generation (5G), Long Term Evolution (LTE)(3G/4G/5G/LTE) Cellular Transceiver 340 Communications Satellite/GlobalPositioning System (Satellite) Transceiver 342 Mouse 344Scanner/Printer/Fax Machine 346 Universal Serial Bus (USB) Cable 348High Definition Multi-Media Interface (HDMI) Cable 350 Ethernet Cable(CAT5) 352 External Memory Storage Device 354 Flash Drive Memory 356CD/DVD Diskettes 358 Floppy Diskettes 360 Keyboard 364 Antenna 366Shell/Box 402 Modulator/Demodulator (Modem) 404 Wireless Router 406Internet Service Provider (ISP) 408 Server/Switch/Router 410 Internet412 Cellular Service Provider 414 Cellular Telecommunications ServiceTower (Cell Tower) 416 Satellite System Control Station 418 GlobalPositioning System (GPS) Station 420 Satellite (Communications/GPS) 422Mobile Electronic Device (MED)/Personal Electronic Device (PED) 424Plain Old Telephone Service (POTS) Provider 518 Equalizer 520Amplifier(s) 522 Loudspeaker(s) 524 Microphone (Mic) 526 DigitalInput(s) 528 Analog Input(s)

Used throughout the specification are several acronyms, the meanings ofwhich are provided as follows:

3G Third Generation 4G Fourth Generation 5G Fifth Generation APB NWAudio Playback Network API Application Programming Interface AppExecutable Software Programming Code/Application ASIC ApplicationSpecific Integrated Circuit BIOS Basic Input/Output System BT BlueToothCD Compact Disk CRT Cathode Ray Tube DVD Digital Video Disk EEPROMElectrically Erasable Programmable Read Only Memory FDD Floppy DisketteDrive FPGA Field Programmable Gate Array GAN Global Area Network GPSGlobal Positioning System GUI Graphical User Interface HDD Hard DiskDrive HDMI High Definition Multimedia Interface ISP Internet ServiceProvider LCD Liquid Crystal Display LED Light Emitting Diode Display LTELong Term Evolution MODEM Modulator-Demodulator NFC Near FieldCommunications OS Operating System PC Personal Computer PED PersonalElectronic Device POTS Plain Old Telephone Service PROM ProgrammableRead Only Memory RAM Random Access Memory ROM Read-Only Memory RWRead/Write USB Universal Serial Bus (USB) Port UV Ultraviolet LightUVPROM Ultraviolet Light Erasable Programmable Read Only Memory VGAVideo Graphics Array

Generally, program modules include routines, programs, components, datastructures, and other types of structures that perform particular tasksor implement particular abstract data types. Moreover, those of skill inthe art can appreciate that different aspects of the embodiments can bepracticed with other computer system configurations, including hand-helddevices, multiprocessor systems, microprocessor-based or programmableconsumer electronics, minicomputers, mainframe computers, and comparablecomputing devices. Aspects of the embodiments can also be practiced indistributed computing environments where tasks are performed by remoteprocessing devices that are linked through a communications network. Ina distributed computing environment, program modules can be located inboth local and remote memory storage devices.

Aspects of the embodiments can be implemented as a computer-implementedprocess (method), a computing system, or as an article of manufacture,such as a computer program product or computer readable media. Thecomputer program product can be a computer storage medium readable by acomputer system and encoding a computer program that comprisesinstructions for causing a computer or computing system to performexample process(es). The computer-readable storage medium is acomputer-readable memory device. The computer-readable storage mediumcan for example be implemented via one or more of a volatile computermemory, a non-volatile memory, a hard drive, a flash drive, a floppydisk, or a compact disk, and comparable hardware media.

Throughout this specification, the term “platform” can be a combinationof software and hardware components for providing share permissions andorganization of content in an application with multiple levels oforganizational hierarchy. Examples of platforms include, but are notlimited to, a hosted service executed over a plurality of servers, anapplication executed on a single computing device, and comparablesystems. The term “server” generally refers to a computing deviceexecuting one or more software programs typically in a networkedenvironment. More detail on these technologies and example operations isprovided below.

A computing device, as used herein, refers to a device comprising atleast a memory and one or more processors that includes a server, adesktop computer, a laptop computer, a tablet computer, a smart phone, avehicle mount computer, or a wearable computer. A memory can be aremovable or non-removable component of a computing device configured tostore one or more instructions to be executed by one or more processors.A processor can be a component of a computing device coupled to a memoryand configured to execute programs in conjunction with instructionsstored by the memory. Actions or operations described herein may beexecuted on a single processor, on multiple processors (in a singlemachine or distributed over multiple machines), or on one or more coresof a multi-core processor. An operating system is a system configured tomanage hardware and software components of a computing device thatprovides common services and applications. An integrated module is acomponent of an application or service that is integrated within theapplication or service such that the application or service isconfigured to execute the component. A computer-readable memory deviceis a physical computer-readable storage medium implemented via one ormore of a volatile computer memory, a non-volatile memory, a hard drive,a flash drive, a floppy disk, or a compact disk, and comparable hardwaremedia that includes instructions thereon to automatically save contentto a location. A user experience can be embodied as a visual displayassociated with an application or service through which a user interactswith the application or service. A user action refers to an interactionbetween a user and a user experience of an application or a userexperience provided by a service that includes one of touch input,gesture input, voice command, eye tracking, gyroscopic input, pen input,mouse input, and keyboards input. An application programming interface(API) can be a set of routines, protocols, and tools for an applicationor service that allow the application or service to interact orcommunicate with one or more other applications and services managed byseparate entities.

While example implementations are described using audio networks herein,embodiments are not limited to such applications. For example, aspectsof the embodiments can be employed in stand-alone audio systems, such asa room in a building that can play be audio through a dedicated systemnot connected to any network, and further can be used with any personalaudio/video device. Anytime audio/video is received for viewing by auser, whether in or through a network or not, systems, methods, andmodes of the aspects of the embodiments can substantially alleviate theproblems of delays between video and live captions for deaf and/or hardof hearing people.

Technical advantages exist for substantially alleviating the problems ofdelays between video and live captions for deaf and/or hard of hearingpeople when using the aspects of the embodiments. Such technicaladvantages can include, but are not limited to, communicating moreeffectively with a greater amount of people.

Aspects of the embodiments address a need that arises from very largescale of operations created by networked computing and cloud-basedservices that cannot be managed by humans. The actions/operationsdescribed herein are not a mere use of a computer, but address resultsof a system that is a direct consequence of software used as a servicesuch as audio network communication services offered in conjunction withcommunications.

While some embodiments will be described in the general context ofprogram modules that execute in conjunction with an application programthat runs on an operating system on a personal computer, those skilledin the art will recognize that aspects may also be implemented incombination with other program modules.

FIGS. 1-4 illustrate various aspects of systems, methods, and modes foralleviating the problems of delays between video and live captions fordeaf and/or hard of hearing people, and which can be used in an audionetwork for use on or with one or more computing devices, including,according to certain aspects of the embodiments, use of the internet orother similar networks. Further, such systems, modes, and methods can beused with personal communications devices, and which can be used in anaudio network for use on or with one or more computing devices,including, according to certain aspects of the embodiments, use of theinternet or other similar networks.

The automatic transcription of audio and then delaying the video suchthat the video and transcribed audio are substantially aligned providesa practical, technical solution to the problem of transcribed audio thatis mis-matched in time with its related video; as those of skill in theart can appreciate, the aspects of the embodiments have no “analogequivalent” as its embodiments reside solely or substantially in thephysical device or computer domain. That is, substantially automaticallyand substantially instantaneously transcribed audio and aligning it withrelated video by delaying the original audio and video can be used withone or more computing devices, including, according to certain aspectsof the embodiments, use of the internet or other similar networks. Thesystems, methods, and modes of the aspects of the embodiments, fortranscribing audio from an audio/video signal always meant, andcontinues to mean, using practical, non-abstract physical devices.

The technological improvement of the aspects of the embodiments residesin at least in the ability to quickly and easily alleviate the problemsof delays between video and live captions for deaf and/or hard ofhearing people by delaying the video while the audio signal istranscribed and then aligning the two within an audio system usingsophisticated computer hardware.

FIG. 1 illustrates a functional block diagram of an audio-to-textconversion and audio-video signal delay circuit for use in anaudio-video playback device or system, according to aspects of theembodiments.

Using a Crestron touchscreen communication device in a conference call,with one person who is hearing impaired, audio, or audio and video canbe delayed by up to 500 milliseconds so that the audio portion can beprocessed through a voice recognition system and an audio-to-textconversion system, and the resultant audio-text can be displayed for thebenefit of the hearing impaired person. Other security features can beincluded, such as encryption, list of authorized recipients, amongothers. In addition, such a system can be implemented on cell phones,and practically any type of personal communication device.

FIG. 1 illustrates a functional block diagram of audio-to-textconversion and audio-video signal delay circuit (audio-video delay (AVD)circuit) 100 for use in an audio-video playback device or system, suchas a personal communication device (e.g., a phone, laptop, or any otherpersonal electronic device (PED) according to aspects of theembodiments.

According to aspects of the embodiments, AVD circuit 100 implement stepsfor receiving an audio-video signal, extracting audio from anaudio-video signal, time stamping both the extracted audio andaudio-video signal, generating captions from the extracted audio andconverting the text to a video text signal, delaying the video for aduration substantially equal to the time it takes to generate thecaptions from the extracted audio to ensure that the captions aresubstantially synchronized with video when recombined, recombining thevideo text signal and delayed audio-video signal based on theirrespective time stamps, and displaying the recombined audio-video signaland video text signal.

As a person of ordinary skill in the art (POSITA) will be able toappreciate following this discussion, the functional block diagramcomponents shown in FIG. 1 can, in general, be implemented as eithersoftware components, hardware components, or any combination thereof.Furthermore, in the following discussion, a POSITA can appreciate thatany of the signals that are shown and discussed can be, for the mostpart, in analog form, or digital form, and any combination thereof. Ingeneral, as a POSITA can appreciate, the only signals that must beanalog are those transmitted to loudspeakers used in AV display 118,discussed below.

AVD circuit 100 comprises audio extractor 102, audio-video (AV) receiver104, caption generator 106, delay 108, combiner/recombiner 110, clock112, at least one processor 114, memory 116, audio video delay &captioning software application (AVDC App) 118, AV display 120, network122, cloud based digital sources of AV signals 124, other digital AVsignal sources 126, analog AV signals 128, and analog AV receiver &analog-to-digital converter processing circuitry 130, the lattermost ofwhich converts received analog AV signals to digital AV signals forprocessing within AVD circuit 100, according to aspects of theembodiments.

AV display 120 comprises one or more currently available displays (e.g.,liquid crystal diode (LCD) displays, light emitting diode (LED)displays, plasma panel displays, and the like), and can further includeone or more AV receivers, audio amplifiers, loudspeakers, digital signalprocessors (DSPs), and digital to analog converters (DAC), among otheranalog and/or digital signal processing devices. According to furtheraspects of the embodiments, AVD circuit 100 can be part of AV display120, either as a separate hardware/software component, or as anintegrated part of the existing circuitry of AV display 120. In thiscase, both analog and digital signals (124, 126, 128) can be directedreceived by AV display 120.

For the purposes of this discussion, each block in the block diagram ofFIG. 1 will be discussed as if a physical circuit or device; however, asdiscussed above, each block in the diagram of AVD circuit 100 can beimplemented in hardware, or as software as part of AVDC App 118, or inany combination thereof. Further, if each block were constructed asphysical devices, AVDC App 118 would coordinate operation and signalflow between such physical devices.

In FIG. 1 , a combined audio video (AV) signal is received by audioextractor 102 and AV receiver 104 substantially simultaneously. Sourcesof AV signals include cloud based AV sources 124 that can be transmittedthrough network 122, as well as other digital AV signals 126 (e.g.,digital video disk (DVD) players, and the like), and analog AV signalsources 128, which are received and processed by analog AV receiver &ADC processing circuit 130 to create digital AV signals that are theninput to audio extractor 102. Network 122 can be virtually any type ofnetwork, including but not limited to a local area network (LAN), globalarea network (GAN), the internet, among other types of networks.Accessible through network 122 are one or more cloud based digitalstreaming audio/video sources 124.

The audio component that is present in the received digital AV signal isextracted by audio extractor 102, and the audio is then time stamped inaudio extractor device 102. The AV signal is time stamped in AV receiver104. The extracted audio is sent to caption processing device 106,wherein the audio transcription process occurs. The original AV signalis delayed in delay device 108; according to aspects of the embodiments,the delay used in delay device 108 needs to be at least as long as ittakes to caption the audio in audio captioning device 106; it can belonger, but must be at least as long otherwise there can be a mismatchbetween the captioned audio and video that is displayed. Because all ofthe signals possess original time stamp information, they can berecombined with substantially no mismatch at all.

As a POSITA can appreciate, it typically takes about 25% of the videolength to generate a suitable caption for the video segment; therefore,if the received AV signal is broken into ten second lengths, it willtake about 2.5 seconds to accurately caption the video. According toaspects of the embodiments, metadata that accompanies the video can beused to more accurately caption the video by providing informationbeforehand regarding the video, and therefore limiting the expectedvocabulary or word lists to be used to generate the caption. Artificialintelligence and machine learning programming techniques can be used aswell. By way of non-limiting example, if a program were received fordelayed captioning was directed towards the subject matter of cakemaking, then if caption generator “heard” or recognized the word “sweet”it would most likely have that word in its metadata list, and not theword “suite” which, of course, refers to rooms. As those of skill in theart can further appreciate, artificial intelligence (AI) programmingtechniques can be incorporated such that AVDC App 118 and captiongenerator 106 can review each previous word or phrase and use thatinformation to generate the word it hears as being the most likely usedword in that particular scenario. Other uses of AI can include AI basednoise suppression, which use various techniques like Recurrent NeuralNetworks that are trained to suppress non-voice audio resulting insignificantly improved voice signal-to-noise ratio (SNR) and clarity.Performing this prior to transcription will improve the transcriptionaccuracy. Rnnoise is a software implementation of such a capability.

Recombination of the captioned audio (a video signal comprising textonly) - i., the output of caption generator 106 - with the delayed AVsignal - i.e., the output of delay 108, occurs in combiner/recombiner110 (hereinafter referred to as recombiner 110). Since the original AVsignal was delayed as combined signal, the output of recombiner 310 is adelayed version of the original AV signal - which incurs no “slippage”between audio and the video it is associated with, and the caption textsignal, which, because of its time stamp, can be matched to the delayedAV signal such that the caption text matches substantially the videoinformation from which it was originally extracted in audio extractiondevice 102.

AVD circuit 100 can also be referred to a processing device. Aprocessing device is generally a server, computer, laptop, or the like,and includes at least one display (not shown), keyboard (which can beseparate or integrated into the display), mouse, and/or other devicescommonly associated with known processor based devices. Processingdevice includes at least one microprocessor 114, memory 116, and AVDCApp 118. AVDC App 118 can also include a portion that generates userinterfaces such as graphical user interfaces (GUIs) through which AVDcircuit 100 can be managed.

FIG. 2 illustrates a flow chart of method 200 for generating captionsfor video and delaying the video to ensure substantially synchronizedcaptions and video within AVD circuit 100 according to aspects of theembodiments. Method 200 can be generally performed by AVDC App 118,stored in memory 116, and executed by microprocessor 114, the steps ofstoring and execution known to a person of ordinary skill in the art.Or, as discussed above, some or all of blocks 102 - 112 can be physicaldevices and controlled by AVDC App 118 stored in memory 116, andexecuted by microprocessor 114 according to aspects of the embodiments.

Method 200 begins with method step 202. In fulfillment of the dualpurposes of clarity and brevity, the source of the AV signal (digital oranalog) is not discussed as the processing that occurs applies equallyto analog sourced signals and digital signals, with the exception ofconverting analog to digital signals, which has been discussed in detailabove in regard to FIG. 1 . Thus, method 200 is described as the AVsignal being received at audio extractor 102 and AV receiver 104 asbeing a digital AV signal. In method step 202 the AV signal is receivedat audio extractor 102 and AV receiver 104, and is time stamped in both102 and 104.

In method step 204, the audio portion of the AV signal that was timestamped is extracted by audio extractor 102; according to aspects of theembodiments, the time stamp is still attached to the audio portion.

In method step 206, the time stamped AV signal is received by delay 108,and a delay is added, Δτ. Substantially simultaneously, the time stampedaudio portion of the received AV signal is received by caption generator106, and captioning of the audio signal begins. According to aspects ofthe embodiments, a predetermined time length of audio signal is loadedinto the caption generator (by way of a non-limiting example, about 10seconds of audio signal), and text generation occurs in the mannerdescribed above. As further described above, it can take about 25% ofthe duration of the audio signal to generate the caption text;therefore, if the duration of the audio signal is about 10 seconds, andit takes 2.5 seconds to generate the caption text from the audio, thelength of the delay Δτ imposed by delay 108 is also about 2.5 seconds.The text that emerges from caption generator 106 is a video signal, butcontains text only.

In method step 208 both the video text signal from caption generator 106and the delayed AV signal from delay 108 - each with a time stamp - areoutput.

In method step 210 the output video text signal from caption generator106 and the delayed AV signal from delay 108 - each with a time stamp -are received by combiner 110. Combiner 110 verifies that the time stampsare substantially similar and then combines the two signals.

In method step 212, the combined text captioned AV signal that has beendelayed is output and received by display 120, wherein further audio andvideo signal processing can occur prior to being displayed on a displayand the audio broadcast by one or more loudspeakers.

FIG. 3 illustrates a block diagram of the major components of a personalcomputer (PC), server, laptop, personal electronic device (PED),personal digital assistant (PDA), tablet (e.g., iPad), or any otherprocessing device/computer, such as AVD circuit 100 (herein after,“processing device 300”) suitable for use to implement method 200 amongothers, for generating captions for video and delaying the video toensure substantially synchronized captions and video within AVD circuit100 according to aspects of the embodiments.

Processing device 300 includes microprocessor 114, with memory 116,within which was stored AVDC App 118; in regard to FIG. 3 , memory 116can take the form of microprocessor internal memory 304, hard disk drive(HDD) 318, random access memory (RAM) 314, and read only memory (ROM)312, as described in greater detail below.

Processing device 300 comprises, among other items, a shell/box 366,integrated display/touchscreen 330 (though not used in every applicationof the computer), internal data/command bus (bus) 308, printed circuitboard (PCB) 316, and one or more processors 114, with processor internalmemory 304 (which can be typically ROM and/or RAM). Those of ordinaryskill in the art can appreciate that in modem computer systems, parallelprocessing is becoming increasingly prevalent, and whereas a singleprocessor would have been used in the past to implement many or at leastseveral functions, it is more common currently to have a singlededicated processor for certain functions (e.g., digital signalprocessors) and therefore could be several processors, acting in serialand/or parallel, as required by the specific application. Processingdevice 300 further comprises multiple input/output ports, such asuniversal serial bus (USB) ports 320, Ethernet ports 322, and videographics array (VGA) ports/high definition multimedia interface (HDMI)ports 324, among other types. Further, processing device 300 includesexternally accessible drives such as compact disk (CD)/digital versatiledisk (DVD) read/write (RW) (CD/DVD/RW) drive 326, and floppy diskettedrive (FDD) 328 (though less used currently, some computers stillinclude this type of interface). Processing device 300 still furtherincludes wireless communication apparatus, such as one or more of thefollowing: Wi-Fi transceiver 332, BlueTooth (BT) transceiver 334, nearfield communications (NFC) transceiver 336, third generation (3G)/fourthGeneration (4G)/long term evolution (LTE)/fifth generation (5G)transceiver (cellular transceiver) 338, communications satellite/globalpositioning system (satellite) transceiver 340, and antenna 364.

Internal memory that is located on PCB 316 itself can comprise HDD 318(these can include conventional magnetic storage media, but, as isbecoming increasingly more prevalent, can include flash drive memory354, among other types), ROM 312 (these can include electricallyerasable programmable ROM (EEPROMs), ultra-violet erasable PROMs(UVPROMs), among other types), and RAM 314. Usable with USB port 320 isflash drive memory 354, and usable with CD/DVD/RW drive 326 are CD/DVDdiskettes (CD/DVD) 356 (which can be both read and write-able). Usablewith FDD 328 are floppy diskettes 358. External memory storage device352 can be used to store data and programs external to processing device300, and can itself comprise another HDD 318, flash drive memory 354,among other types of memory storage. External memory storage device 352is connectable to processing device 300 via universal serial bus (USB)cable 346. Each of the memory storage devices, or the memory storagemedia (116, 318, 312, 314, 352, 354, 356, and 358, among others), cancontain parts or components, or in its entirety, executable softwareprogramming code or application that has been termed AVDC App 118according to aspects of the embodiments, which can implement part or allof the portions of method 200 among other methods not shown, describedherein.

In addition to the above described components, processing device 300also comprises keyboard 360, external display 330, printer/scanner/faxmachine 344, and mouse 342 (although not technically part of theprocessing device 300, the peripheral components as shown in FIG. 3(352, 120, 360, 342, 354, 356, 358, 346, 350, 344, and 348) are adaptedfor use with processing device 300 that for purposes of this discussionthey shall be considered as being part of the processing device 300).Other cable types that can be used with processing device 300 include RS232, among others, not shown, that can be used for one or more of theconnections between processing device 300 and the peripheral componentsdescribed herein. Keyboard 360, and mouse 342 are connectable toprocessing device 300 via USB cable 346, and external display 120 isconnectible to processing device 300 via VGA cable/HDMI cable 348.Processing device 300 is connectible to network 122 via Ethernet port322 and Ethernet cable 350 via a router and modulator-demodulator(MODEM) and internet service provider, none of which are shown in FIG. 3. All of the immediately aforementioned components (324, 352, 120, 360,342, 354, 356, 358, 346, 350, and 344) are known to those of ordinaryskill in the art, and this description includes all known and futurevariants of these types of devices.

External display 120 can be any type of currently available display orpresentation screen, such as liquid crystal displays (LCDs), lightemitting diode displays (LEDs), plasma displays, cathode ray tubes(CRTs), among others (including touch screen displays). In addition tothe user interface mechanism such as mouse 342, processing device 300can further include a microphone, touch pad, joystick, touch screen,voice-recognition system, among other interactive inter-communicativedevices/programs, which can be used to enter data and voice, and whichall of are currently available and thus a detailed discussion thereofhas been omitted in fulfillment of the dual purposes of clarity andbrevity.

As mentioned above, processing device 300 further comprises a pluralityof wireless transceiver devices, such as Wi-Fi transceiver 332, BTtransceiver 334, NFC transceiver 336, cellular transceiver 338,satellite transceiver 340, and antenna 364. While each of Wi-Fitransceiver 332, BT transceiver 334, NFC transceiver 336, cellulartransceiver 338, and satellite transceiver 340 has their own specializedfunctions, each can also be used for other types of communications, suchas accessing a cellular service provider (not shown), accessing network122 (which can include the Internet), texting, emailing, among othertypes of communications and data/voice transfers/exchanges, as known tothose of skill in the art. Each of Wi-Fi transceiver 332, BT transceiver334, NFC transceiver 336, cellular transceiver 338, satellitetransceiver 340 includes a transmitting and receiving device, and aspecialized antenna, although in some instances, one antenna can beshared by one or more of Wi-Fi transceiver 332, BT transceiver 334, NFCtransceiver 336, cellular transceiver 338, and satellite transceiver340. Alternatively, one or more of Wi-Fi transceiver 332, BT transceiver334, NFC transceiver 336, cellular transceiver 338, and satellitetransceiver 340 will have a specialized antenna, such as satellitetransceiver 340 to which is electrically connected at least one antenna364.

In addition, processing device 300 can access network 122 (of which theInternet can be part of, as shown and described in FIG. 4 below), eitherthrough a hard wired connection such as Ethernet port 322 as describedabove, or wirelessly via Wi-Fi transceiver 332, cellular transceiver 338and/or satellite transceiver 340 (and their respective antennas)according to aspects of the embodiments. Processing device 300 can alsobe part of a larger network configuration as in a GAN (e.g., internet),which ultimately allows connection to various landlines.

According to further aspects of the embodiments, integrateddisplay/touchscreen 330, keyboard 360, mouse 342, and external display120 (if in the form of a touch screen), can provide a means for a userto enter commands, data, digital, and analog information into theprocessing device 300. Integrated and external displays 330, 120 can beused to show visual representations of acquired data, and the status ofapplications that can be running, among other things.

Bus 308 provides a data/command pathway for items such as: the transferand storage of data/commands between processor 114, Wi-Fi transceiver332, BT transceiver 334, NFC transceiver 336, cellular transceiver 338,satellite transceiver 340, integrated display 330, USB port 320,Ethernet port 322, VGA/HDMI port 324, CD/DVD/RW drive 326, FDD 328, andprocessor internal memory 304. Through bus 308, data can be accessedthat is stored in processor internal memory 304. Processor 114 can sendinformation for visual display to either or both of integrated andexternal displays 330, 120, and the user can send commands to thecomputer operating system (operating system (OS)) 306 that can reside inprocessor internal memory 304 of processor 114, or any of the othermemory devices (356, 358, 318, 312, and 314).

Processing device 300, and either internal memories 304, 312, 314, and318, or external memories 352, 354, 356 and 358, can be used to storecomputer code that when executed, implements method 200, as well asother methods not shown and discussed, for substantially automaticallyestablishing secure communications between similar audio devices,according to aspects of the embodiments. Hardware, firmware, software,or a combination thereof can be used to perform the various steps andoperations described herein. According to aspects of the embodiments,AVDC App 118 for carrying out the above discussed steps can be storedand distributed on multi-media storage devices such as devices 318, 312,314, 354, 356 and/or 358 (described above) or other form of mediacapable of portably storing information. Storage media 354, 356 and/or358 can be inserted into, and read by devices such as USB port 320,CD/DVD/RW drive 326, and FDD 328, respectively.

As also will be appreciated by one skilled in the art, the variousfunctional aspects of the aspects of the embodiments can be embodied ina wireless communication device, a telecommunication network, or as amethod or in a computer program product. Accordingly, aspects ofembodiments can take the form of an entirely hardware embodiment or anembodiment combining hardware and software aspects. Further, the aspectsof embodiments can take the form of a computer program product stored ona computer-readable storage medium having computer-readable instructionsembodied in the medium. Any suitable computer-readable medium can beutilized, including hard disks, CD-ROMs, DVDs, optical storage devices,or magnetic storage devices such a floppy disk or magnetic tape. Othernon-limiting examples of computer-readable media include flash-typememories or other known types of memories.

Further, those of ordinary skill in the art in the field of the aspectsof the embodiments can appreciate that such functionality can bedesigned into various types of circuitry, including, but not limited tofield programmable gate array structures (FPGAs), application specificintegrated circuitry (ASICs), microprocessor based systems, among othertypes. A detailed discussion of the various types of physical circuitimplementations does not substantively aid in an understanding of theaspects of the embodiments, and as such has been omitted for the dualpurposes of brevity and clarity. However, the systems and methodsdiscussed herein can be implemented as discussed and can further includeprogrammable devices.

Such programmable devices and/or other types of circuitry as previouslydiscussed can include a processing unit, a system memory, and a systembus that couples various system components including the system memoryto the processing unit. The system bus can be any of several types ofbus structures including a memory bus or memory controller, a peripheralbus, and a local bus using any of a variety of bus architectures.Furthermore, various types of computer readable media can be used tostore programmable instructions. Computer readable media can be anyavailable media that can be accessed by the processing unit. By way ofexample, and not limitation, computer readable media can comprisecomputer storage media and communication media. Computer storage mediaincludes volatile and nonvolatile as well as removable and non-removablemedia implemented in any method or technology for storage of informationsuch as computer readable instructions, data structures, program modulesor other data. Computer storage media includes, but is not limited to,RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROMs, DVDsor other optical disk storage, magnetic cassettes, magnetic tape,magnetic disk storage or other magnetic storage devices, or any othermedium which can be used to store the desired information, and which canbe accessed by the processing unit. Communication media can embodycomputer readable instructions, data structures, program modules orother data in a modulated data signal such as a carrier wave or othertransport mechanism and can include any suitable information deliverymedia.

The system memory can include computer storage media in the form ofvolatile and/or nonvolatile memory such as ROM and/or RAM. A basicinput/output system (BIOS), containing the basic routines that help totransfer information between elements connected to and between theprocessor, such as during start-up, can be stored in memory. The memorycan also contain data and/or program modules that are immediatelyaccessible to and/or presently being operated on by the processing unit.By way of non-limiting example, the memory can also include an operatingsystem, application programs, other program modules, and program data.

The processor can also include other removable/non-removable andvolatile/nonvolatile computer storage media. For example, the processorcan access a hard disk drive that reads from or writes to non-removable,nonvolatile magnetic media, a magnetic disk drive that reads from orwrites to a removable, nonvolatile magnetic disk, and/or an optical diskdrive that reads from or writes to a removable, nonvolatile opticaldisk, such as a CD-ROM or other optical media. Otherremovable/non-removable, volatile/nonvolatile computer storage mediathat can be used in the operating environment include, but are notlimited to, magnetic tape cassettes, flash memory cards, digitalversatile disks, digital video tape, solid state RAM, solid state ROMand the like. A hard disk drive can be connected to the system busthrough a non-removable memory interface such as an interface, and amagnetic disk drive or optical disk drive can be connected to the systembus by a removable memory interface, such as an interface.

Aspects of the embodiments discussed herein can also be embodied ascomputer-readable codes on a computer-readable medium. Thecomputer-readable medium can include a computer-readable recordingmedium and a computer-readable transmission medium. Thecomputer-readable recording medium is any data storage device that canstore data which can be thereafter read by a computer system. Examplesof the computer-readable recording medium include ROM, RAM, CD-ROMs andgenerally optical data storage devices, magnetic tapes, flash drives,and floppy disks. The computer-readable recording medium can also bedistributed over network coupled computer systems so that thecomputer-readable code is stored and executed in a distributed fashion.The computer-readable transmission medium can transmit carrier waves orsignals (e.g., wired, or wireless data transmission through theInternet). Also, functional programs, codes, and code segments to, whenimplemented in suitable electronic hardware, accomplish or supportexercising certain elements of the appended claims can be readilyconstrued by programmers skilled in the art to which the aspects of theembodiments pertains.

The disclosed aspects of the embodiments provide a system and method forgenerating captions for video and delaying the video to ensuresubstantially synchronized captions and video within AVD circuit 100,according to aspects of the embodiments, on one or more computers orprocessing devices 300. It should be understood that this description isnot intended to limit aspects of the embodiments. On the contrary,aspects of the embodiments are intended to cover alternatives,modifications, and equivalents, which are included in the spirit andscope of the aspects of the embodiments as defined by the appendedclaims. Further, in the detailed description of the aspects of theembodiments, numerous specific details are set forth to provide acomprehensive understanding of the claimed aspects of the embodiments.However, one skilled in the art would understand that various aspects ofthe embodiments can be practiced without such specific details.

FIG. 4 illustrates network system 122 within which the system and methodfor generating captions for video and delaying the video to ensuresubstantially synchronized captions and video within AVD circuit 100 canbe used, according to aspects of the embodiments. Much of theinfrastructure of network system 122 shown in FIG. 4 is or should beknown to those of skill in the art, so, in fulfillment of the dualpurposes of clarity and brevity, a detailed discussion thereof shall beomitted.

According to aspects of the embodiments, a user of the above describedsystem and method can store AVDC App 118 on their processing device 300as well as mobile electronic device (MED)/PED 422 (hereon in referred toas “PEDs 422). PEDs 422 can include, but are not limited to, so-calledsmart phones, tablets, personal digital assistants (PDAs), notebook andlaptop computers, and essentially any device that can access theinternet and/or cellular phone service or can facilitate transfer of thesame type of data in either a wired or wireless manner.

PED 422 can access cellular service provider 412, either through awireless connection (cell tower 414) or via a wireless/wiredinterconnection (a “Wi-Fi” system that comprises, e.g., modem 402,wireless router 404, internet service provider (ISP) 406, and internet410 (although not shown, those of skill in the art can appreciate thatinternet 410 comprises various different types of communications cables,servers/routers/switches 408, and the like, whereindata/software/applications of all types is stored in memory within orattached to servers or other processor based electronic devices,including, for example, AVDC App 118 within a computer/server that canbe accessed by a user of AVDC App 118 on their PED 422 and/or processingdevice 300). As those of skill in the art can further appreciate,internet 410 can include access to “cloud” computing service(s) anddevices, wherein the cloud refers to the on-demand availability ofcomputer system resources, especially data storage and computing power,without direct active management by the user. Large clouds often havefunctions distributed over multiple locations, each location being adata center.

Further, PED 422 can include NFC, “Wi-Fi,” and Bluetooth (BT)communications capabilities as well, all of which are known to those ofskill in the art. To that end, network system 122 further includes, asmany homes (and businesses) do, one or more computers or processingdevices 300 that can be connected to wireless router 404 via a wiredconnection (e.g., modem 402) or via a wireless connection (e.g.,Bluetooth). Modem 402 can be connected to ISP 406 to provideinternet-based communications in the appropriate format to end users(e.g., processing device 300), and which takes signals from the endusers and forwards them to ISP 406.

PEDs 422 can also access global positioning system (GPS) satellite 420,which is controlled by GPS station 418, to obtain positioninginformation (which can be useful for different aspects of theembodiments), or PEDs 422 can obtain positioning information viacellular service provider 412 using cellular tower(s) (cell tower) 414according to one or more methods of position determination. Some PEDs422 can also access communication satellites 420 and their respectivesatellite communication systems control stations 416 (the satellite inFIG. 4 is shown common to both communications and GPS functions) fornear-universal communications capabilities, albeit at a much higher costthan convention “terrestrial” cellular services. PEDs 422 can alsoobtain positioning information when near or internal to a building (orarena/stadium) through the use of one or more of NFC/BT devices. FIG. 4also illustrates other components of network 122 such as plain oldtelephone service (POTS) provider 424.

According to further aspects of the embodiments, and as described above,network 122 also contains other types of servers/devices that caninclude processing device 300, wherein one or more processors, usingcurrently available technology, such as memory, data and instructionbuses, and other electronic devices, can store and implement code thatcan implement the system and method for generating captions for videoand delaying the video to ensure substantially synchronized captions andvideo within AVD circuit 100, according to aspects of the embodiments.

According to further aspects of the embodiments, additional features andfunctions of inventive embodiments are described herein below, whereinsuch descriptions are to be viewed in light of the above noted detailedembodiments as understood by those skilled in the art.

As described above, an encoding process is discussed specifically inreference to FIG. 2 , although such delineation is not meant to be, andshould not be taken in a limiting manner, as additional methodsaccording to aspects of the embodiments have been described herein. Theencoding processes as described are not meant to limit the aspects ofthe embodiments, or to suggest that the aspects of the embodimentsshould be implemented following the encoding processes. The purpose ofthe encoding processes as described is to facilitate the understandingof one or more aspects of the embodiments and to provide the reader withone or many possible implementations of the processed discussed herein.FIG. 2 illustrates a flowchart of various steps performed during theencoding process, but such encoding processes are not limited thereto.The steps of FIG. 2 are not intended to completely describe the encodingprocesses but only to illustrate some of the aspects discussed above.

This application may contain material that is subject to copyright, maskwork, and/or other intellectual property protection. The respectiveowners of such intellectual property have no objection to the facsimilereproduction of the disclosure by anyone as it appears in publishedPatent Office file/records, but otherwise reserve all rights.

It should be understood that this description is not intended to limitthe embodiments. On the contrary, the embodiments are intended to coveralternatives, modifications, and equivalents, which are included in thespirit and scope of the embodiments as defined by the appended claims.Further, in the detailed description of the embodiments, numerousspecific details are set forth to provide a comprehensive understandingof the claimed embodiments. However, one skilled in the art wouldunderstand that various embodiments may be practiced without suchspecific details.

Although the features and elements of aspects of the embodiments aredescribed being in particular combinations, each feature or element canbe used alone, without the other features and elements of theembodiments, or in various combinations with or without other featuresand elements disclosed herein.

This written description uses examples of the subject matter disclosedto enable any person skilled in the art to practice the same, includingmaking and using any devices or systems and performing any incorporatedmethods. The patentable scope of the subject matter is defined by theclaims, and may include other examples that occur to those skilled inthe art. Such other examples are intended to be within the scope of theclaims.

The above-described embodiments are intended to be illustrative in allrespects, rather than restrictive, of the embodiments. Thus, theembodiments are capable of many variations in detailed implementationthat can be derived from the description contained herein by a personskilled in the art. No element, act, or instruction used in thedescription of the present application should be construed as criticalor essential to the embodiments unless explicitly described as such.Also, as used herein, the article “a” is intended to include one or moreitems.

All United States patents and applications, foreign patents, andpublications discussed above are hereby incorporated herein by referencein their entireties.

Industrial Applicability

To solve the aforementioned problems, the aspects of the embodiments aredirected towards systems, methods, and modes for receiving anaudio-video signal, extracting audio from an audio-video signal, timestamping both the extracted audio and audio-video signal, generatingcaptions from the extracted audio and converting the text to a videotext signal, delaying the video for a duration substantially equal tothe time it takes to generate the captions from the extracted audio toensure that the captions are substantially synchronized with video whenrecombined, recombining the video text signal and delayed audio-videosignal based on their respective time stamps, and displaying therecombined audio-video signal and video text signal.

Alternate Embodiments

Alternate embodiments may be devised without departing from the spiritor the scope of the different aspects of the embodiments.

What is claimed is:
 1. A method for generating text caption informationfor an audio-video (AV) signal, the method comprising: receiving an AVsignal; extracting audio from the AV signal to form an extracted audiosignal; time stamping both the extracted audio signal and the receivedAV signal; partitioning the extracted audio signal into a firstpredetermined duration segment of extracted audio signal; generatingtext captions from the partitioned extracted audio signal over a firstduration, and converting the same to a video text signal, with the sametime stamp as the extracted audio signal and received AV signal;delaying the received AV signal by an amount of time substantiallysimilar to the first duration; combining the time stamped video textsignal and the delayed time stamped received AV signal based on the timestamps; and outputting the combined time stamped video text signal andthe time stamped received AV signal to a display.
 2. The methodaccording to claim 1, wherein the step of generating text captions fromthe partitioned extracted audio signal over a first duration furthercomprises: comparing the generated text captions with a list of textobtained by a source of the AV signal to improve accuracy of thegenerated text captions.
 3. The method according to claim 2, wherein thelist of text obtained by the source of the AV signal comprises textassociated with the subject matter of the AV signal.
 4. The methodaccording to claim 1, wherein the step of generating text captions fromthe partitioned extracted audio signal over a first duration furthercomprises: obtaining metadata from the AV signal; generating a list oftext that substantially matches the subject matter of the AV signalbased on the obtained metadata; comparing the generated text captionswith the generated list of text based on the obtained metadata toimprove accuracy of the generated text captions.
 5. The method accordingto claim 1, wherein the step of generating text captions from thepartitioned extracted audio signal over a first duration furthercomprises: using artificial intelligence programming techniques todevelop a list of text that substantially matches the subject matter ofthe AV signal based on the obtained metadata; comparing the generatedtext captions with the AI developed list of text to improve accuracy ofthe generated text captions.
 6. The method according to claim 5, whereinthe AI techniques comprise: Recurrent Neural Networks that are trainedto suppress non-voice audio resulting in significantly improved voicesignal-to-noise ratio (SNR) and clarity.
 7. A system for generating textcaption information for an audio-video (AV) signal system, comprising:an audio-video (AV) signal receiver; at least one processor that is partof the AV signal receiver; a memory operatively connected with the atleast one processor, wherein the memory stores computer-executableinstructions that, when executed by the at least one processor, causesthe at least one processor to execute a method that comprises: receivingan AV signal at the AV signal receiver; extracting audio from the AVsignal to form an extracted audio signal; time stamping both theextracted audio signal and the received AV signal; partitioning theextracted audio signal into a first predetermined duration segment ofextracted audio signal; generating text captions from the partitionedextracted audio signal over a first duration, and converting the same toa video text signal, with the same time stamp as the extracted audiosignal and received AV signal; delaying the received AV signal by anamount of time substantially similar to the first duration; combiningthe time stamped video text signal and the delayed time stamped receivedAV signal based on the time stamps; and outputting the combined timestamped video text signal and the time stamped received AV signal to adisplay.
 8. The system according to claim 7, wherein the step ofgenerating text captions from the partitioned extracted audio signalover a first duration further comprises: comparing the generated textcaptions with a list of text obtained by a source of the AV signal toimprove accuracy of the generated text captions.
 9. The system accordingto claim 8, wherein the list of text obtained by the source of the AVsignal comprises text associated with the subject matter of the AVsignal.
 10. The system according to claim 7, wherein the step ofgenerating text captions from the partitioned extracted audio signalover a first duration further comprises: obtaining metadata from the AVsignal; generating a list of text that substantially matches the subjectmatter of the AV signal based on the obtained metadata; comparing thegenerated text captions with the generated list of text based on theobtained metadata to improve accuracy of the generated text captions.11. The system according to claim 7, wherein the step of generating textcaptions from the partitioned extracted audio signal over a firstduration further comprises: using artificial intelligence programmingtechniques to develop a list of text that substantially matches thesubject matter of the AV signal based on the obtained metadata;comparing the generated text captions with the AI developed list of textto improve accuracy of the generated text captions.
 12. The systemaccording to claim 11, wherein the AI programming techniques comprises:Recurrent Neural Networks that are trained to suppress non-voice audioresulting in significantly improved voice signal-to-noise ratio (SNR)and clarity.
 13. The system according to claim 7, wherein the AV signalreceiver, at least one processor and memory are part of an audio videodisplay device.